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We compare some methods recently used in the literature to detect the 
existence of a certain degree of common behavior of stock returns belonging 
to the same economic sector. Specifically, we discuss methods based on ran- 
dom matrix theory and hierarchical clustering techniques. We apply these 
methods to a portfolio of stocks traded at the London Stock Exchange. The 
investigated time series are recorded both at a daily time horizon and at 
a 5-minute time horizon. The correlation coefRcient matrix is very differ- 
ent at different time horizons confirming that more structured correlation 
coefficient matrices are observed for long time horizons. All the consid- 
ered methods are able to detect economic information and the presence of 
clusters characterized by the economic sector of stocks. However different 
methods present a different degree of sensitivity with respect to different 
sectors. Our comparative analysis suggests that the application of just a 
single method could not be able to extract all the economic information 
present in the correlation coefficient matrix of a stock portfolio. 

PACS numbers: 89.75.Fb Structures and organization in complex systems, 
89. 75. He Networks and genealogical trees, 89.65.Gh Economics; econophysics, fi- 
nancial markets, business and management 

1. Introduction 

Multivariate time series are detected and recorded both in experiments 
and in the monitoring of a wide number of physical, biological and economic 
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systems. A first instrument in the investigation of a multivariate time series 
is the correlation matrix. The study of the properties of the correlation 
matrix has a direct relevance in the investigation of mesoscopic physical 
systems high energy physics (2), information theory and communication 
investigation of microarr ay d ata in biological systems (Q^iiS) and 
econophysics (UtlO; [nl; E H Q; lH) • 



Multivariate stock return time series are characterized by a correlation 
matrix which is c arrying information about the economic sectors of the 
considered stocks (H El III; IS,; IE; M, 22l, M, S H Ei) • 

Recent empirical and theoretical analysis have shown that this informa- 
tion can be detected by using a variety of methods. In this paper we re- 
view some of these methods based on Random Matrix Theory (RMT) 
correlation based clustering (jllh . and topological properties of correlation 
based graphs |25h . The common and different aspects of these methods are 
discussed by considering the results of an analysis investigating the set of 
n = 92 stocks belonging to "SET 1" of the London Stock Exchange (LSE). 
The time period of the time series is the entire 2002 year and the analysis 
is performed at two different time horizons. Specifically, we investigate the 
5-minute time horizon and the daily time horizon to show the differences 
detected in the structure of the correlation matrix of high frequency and 
daily returns. 

The paper is organized as follows: in Section 2 we discuss the methods 
used to extract economic information from a correlation matrix of a stock 
portfolio by using concepts and tools of RMT and hierarchical clustering. 
The investigated correlation based clustering procedures are the single link- 
age and average linkage. We also consider a graph obtained by imposing 
the topological constraint of planarity during its construction along a well 
defined algorithmic procedure. This graph has been named by authors as 
the Planar Maximally Filtered Graph (PMFG). In Section 3 we present the 
empirical results obtained for daily returns of the 92 stocks belonging to 
"SET 1" of the LSE recorded in 2002. Section 4 presents the the empirical 
results obtained for 5-minute returns of the same set of data. In Section 5 
we draw our conclusions. 



2. Methods 

In this section we review several methods used to select part of the 
content of the correlation coefficient matrix which is robust with respect to 
statistical uncertainty and carrying economic information. 

The correlation coefficient between the time evolution of two stock return 
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time series is defined as 

P..(At) = ^J22l=M2^= z,J = l,...,n (1) 

^m-{nm{r])-{r,)^) 

where n is the number of stocks, i and j label the stocks, rj is the logarithmic 
return defined by Vi = InPj(t) — lnPj(t — At), Pi{t) is the value of the 
stock price i at the trading time t and At is the time horizon at which one 
computes the returns. In this work the correlation coefficient is computed 
between synchronous return time series. The correlation coefficient matrix 
is an n X n matrix whose elements are the correlation coefficients pij{At). 

We start our review of methods by discussing the application of concepts 
of RMT which have been used to select the eigenvalues and eigenvectors 
of the correlation matrix less affected by statistical uncertainty. Then we 
consider two different correlation based clustering procedures. Correlation 
based clustering procedures are used to obtain a reduced number of simi- 
larity measures representative of the whole original correlation matrix. The 
filtering procedure associated with a reduction of the considered similarity 
measures is typically going from n(n — l)/2 distinct elements to a number 
of similarity measures of the order of n. The first clustering procedure we 
consider here is the single linkage clustering method that has been repeat- 
edly used to detect a hierarchical organization of stocks and the associated 
Minimum Spanning Tree (MST) and PMFG. The PMFG is a recently in- 
troduced graph extending the number of similarity measures associated to 
the graph with respect to the ones present in the MST. This extension of 
considered links is done by conserving the same hierarchical tree of the MST 
(|25h . The second clustering procedure is the average linkage which provides 
a different taxonomy and the last one is the PMFG. 



2.1. Random Matrix Theory 

Random Matrix Theory was originally developed in nuclear physics 
and then applied to many different fields. In the context of asset portfo- 
lio management RMT is useful because it allows to compute the effect of 
statistical uncertainty in the estimation of the correlation matrix. Suppose 
that the n assets are described by n time series of T time records and that 
the returns are independent Gaussian random variables with zero mean and 
variance a^. The correlation matrix of this set of variables in the limit 
T — > oo is simply the identity matrix. When T is finite the correlation 
matrix will in general be different from the identity matrix. RMT allows 
to prove that in the limit T,n ^ oo, with a fixed ratio Q = T/n > 1, the 
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eigenvalue spectral density of the covariance matrix is given by 



Q 



A) (A — Xmin) 



(2) 



max 



where X^f^ = cr^(l + l/Q ± 2^J\|Q). The spectral density is different from 
zero in the interval ]Amm)Amax[- In the case of a correlation matrix it is 
= 1. The spectrum described by Eq. [21is different from (5(A — 1) which is 
expected by an identity correlation matrix. In other words RMT quantifies 
the role of the finiteness of the length of the time series on the spectral 
properties of the correlation matrix. 

RMT has been applied to the investigation of correlation matrices of 
financial asset returns (0; lioh and it has been shown that the spectrum of a 
typical portfolio can be divided in three classes of eigenvalues. The largest 
eigenvalue is totally incompatible with Eq. |21 and describes the common 
behavior of the stocks composing the portfolio. This fact leads to another 
working hypothesis that the part of correlation matrix which is orthogonal 
to the eigenvector corresponding to the first eigenvalue is random. This 
amounts to quantify the variance of the part not explained by the highest 
eigenvalue as o"^ = 1 — Ai /n and to use this value in Eq. [2 to compute \min 
and Xmax- Under this assumption, previous studies have shown that a frac- 
tion of the order of few percent of the eigenvalues are also incompatible with 
the RMT because they fall outside the interval ]Amjni Xmax \ computed with 
the value of a taking into account the behavior of the first eigenvalue. These 
eigenvalues probably describe economic information stored in the correla- 
tion matrix. The remaining large part of the eigenvalues is between \min 
and \max and thus one cannot say whether any information is contained in 
the corresponding eigenspace. 

The fact that by using RMT it is possible, under certain assumptions, 
to identify the part of the correlation matrix containing economic informa- 
tion suggested some authors to use RMT for showing that some selected 
eigenvectors, i.e. eigenvectors associated to eigenvalues not explained by 
RMT, describe economic sectors. Specifically the suggested method (0) is 
the following. One computes the correlation matrix and finds the spectrum 
ranking the eigenvalues such that A^ > A^+i. The eigenvector correspond- 
ing to Afc is denoted u'^. The set of investigated stocks is partitioned in S 
sectors s = 1,2,..., 5* according to their economic activity (for example by 
using classification codes such as the one of the Standard Industrial Classi- 
fication code or Forbes). One then defines a 5 x n projection matrix P with 
elements Psi = ^/ris if stock i belongs to sector s and Psi = otherwise. 
Here Ug is the number of stocks belonging to sector s. For each eigenvector 
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u one computes 

n 

(3) 

1=1 

This number gives a measure of the role of a given sector s in explaining the 
composition of eigenvector u'^. Thus when a given eigenvector has a large 
value of for only one (or few) sector s, one can conclude that the eigen- 
vector describes that economic sector. Note that this method requires the 
a priori knowledge of the sector for each stock in order to be implemented. 

2.2. Hierarchical Clustering Methods 

Another approach used to detect the information associated to the cor- 
relation matrix is given by the correlation based hierarchical clustering anal- 
ysis. Consider a set of n objects and suppose that a similarity measure, e.g. 
the correlation coefficient, between pairs of elements is defined. Similar- 
ity measures can be written in a n x n similarity matrix. The hierarchical 
clustering methods allow to hierarchically organize the elements in clusters. 
The result of the procedure is a rooted tree or dendrogram giving a quan- 
titative description of the clusters thus obtained. It is worth noting that 
hierarchical clustering methods can as well be applied to distance matrices. 

A large number of hierarchical clustering procedures can be found in the 
literature. For a review about the classical techniques see for instance Ref. 
(113). In this paper we focus out attention on the Single Lin kag e Cluster 
Analysis (SLCA), which was introduced in finance in Ref. (|llh and the 
Average Linkage Cluster Analysis (ALCA). 

2.2.1. Single Linkage Correlation Based Clustering 

The Single Linkage Cluster Analysis is a filtering procedure based on the 
estimation of the subdominant ultrametric distance ((28i) associated with a 
metric distance obtained from the correlation coefficient matrix of a set of n 
stocks. This procedure, already used in other fields, allows to extract a MST 
and a hierarchical tree from a correlation coefficient matrix by means of a 
well defined algorithm known as nearest neighbor single linkage clustering 
algorithm (|29h. This methodology allows to reveal both topological (through 
the MST) and taxonomic (through the hierarchical tree) aspects of the 
correlation present among stocks. 

The MST is obtained by selecting a relevant part of the information 
which is present in the correlation coefficient matrix of the time series of 
stock returns. In the present study this is done (i) by determining the syn- 
chronous correlation coefficient of the difference of logarithm of stock price 
computed at a selected time horizon, (ii) by calculating a metric distance 
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between all the pair of stocks and (iii) by selecting the subdominant ultra- 
metric distance associated to the considered metric distance. The subdom- 
inant ultrametric is the ultrametric structure closest to the original metric 
structure (HI). 

A metric distance between pair of stocks can be rigorously determined 
by defining 

dij = ^2{l-pij) (4) 

With this choice dij fulfills the three axioms of a metric (i) dij = if and 
only if i = j ; (ii) dij = dji and (iii) dij < dik + dkj- The distance matrix D 
is then used to determine the MST connecting the n stocks. 

The MST is a graph without loops connecting all the n nodes with the 
shortest n — 1 links amongst all the links between the nodes. The selection 
of these n — 1 links is done according to some widespread algorithm (jsih 
and can be summarized as follows: 



1. Construct an ordered list of pair of stocks Lord^ by ranking all the 
possible pairs according to their distance dij. The first pair of Lo^.^ 
has the shortest distance. 

2. The first pair of Lord gives the first two elements of the MST and the 
link between them. 

3. The construction of the MST continues by analyzing the list Lo^d- At 
each successive stage, a pair of elements is selected from L^rd and the 
corresponding link is added to the MST only if no loops are generated 
in the graph after the link insertion. 

Different elements of the list are therefore iteratively included in the MST 
starting from the first two elements of Lord- As a result, one obtains a graph 
with n vertices and n — 1 links. For a didactic description of the method 
used to obtain the MST one can consult Ref. (js^) 

In Ref. (^S^) the procedure briefly sketched above has been shown to 
provide a MST which is associated to the same hierarchical tree of the 
SLCA. In this procedure, at each step, when two elements or one element 
and a cluster or two clusters p and q merge in a wider single cluster t, 
the distance dtr between the new cluster t and any cluster r is recursively 
determined as follows: 

dtr = inm{dpr,dqr} (5) 

thus indicating that the distance between any element of cluster t and any 
element of cluster r is the shortest distance between any two entities in 
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clusters t and r. By applying iteratively this procedure n— 1 of the 1)/2 
distinct elements of the original correlation coefficient matrix are selected. 

The distance matrix obtained by applying the SLCA is an ultrametric 
matrix comprising n — 1 distinct selected elements. The ultrametric distance 
dfj between element i belonging to cluster t and element j belonging to clus- 
ter r is therefore defined as the distance between clusters t and r. Ultramet- 
ric distances dfj are distances satisfying the inequality dfj < maxjij^f^., df -} 
stronger than the customary triangular inequality dij < dik + dkj (|28l ) . The 
SLCA has associated an ultrametric correlation matrix which is the sub- 
dominant ultrametric matrix of the original correlation coefficient matrix. 
It can be obtained starting from the ultrametric distances dfj and making 
use of Eq. IH 

The MST allows to obtain, in a direct and essentially unique way, the 
subdominant ultrametric distance matrix D*^ and the hierarchical organi- 
zation of the elements of the investigated data set. In Ref. (|3^ it is proved 
that the ultrametric correlation matrix obtained by the SLCA is always pos- 
itive definite when all the elements of the obtained ultrametric correlation 
matrix are non negative. This condition is rather common in financial data. 

The effectiveness of the SLCA in pointing out the hierarchical structure 
of the inve stig ated portfolio has been shown by several studies (|n ; B; 1^ : 

10; EH; il; Hi Hi; Hi). 



2.2.2. Average Linkage Correlation Based Clustering 

The Average Linkage Cluster Analysis is a hierarchical clustering pro- 
cedure (H^) that can be described by considering either a similarity or a 
distance measure. Here we consider the distance matrix D. The following 
procedure performs the ALCA giving as an output a rooted tree and an 
ultrametric matrix of elements df,: 

1. Set T as the matrix of elements such that T = D. 



2. Select the minimum distance tfik in the distance matrix T. Note that 
after the first step of construction h and k can be simple elements (i.e. 
clusters of one element each) or clusters (sets of elements). 

3. Merge cluster h and cluster k into a single cluster, say h. The merging 
operation identifies a node in the rooted tree connecting clusters h and 
k at the distance thk- Furthermore to obtain the ultrametric matrix 
it is sufficient that \/ i £ h and y j £ k one sets dfj = = thk- 
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4. Redefine the matrix T: 



Nh thj + Nk tkj 
Nh + Nk 



if J 7^ /i and j ^ k 



< 



tij 



otherwise 



where Nh and are the number of elements belonging respectively 
to the cluster h and to the cluster k. Note that if the dimension of T 
is m X m then the dimension of the redefined T is (m — 1) x (m — 1) 
because of the merging of clusters h and k into the cluster h. 

5. If the dimension of T is bigger than one then go to step 2 else Stop. 

By replacing point 4 of the above algorithm with the following item 

4. Redefine the matrix T: 



one obtains an algorithm performing the SLCA which is therefore equivalent 
to the one described in the previous section. The algorithm can be easily 
adapted for working with similarities instead of distances. It is just enough 
to exchange the distance matrix D with a similarity matrix (for instance 
the correlation matrix) and replace the search for the minimum distance 
in the matrix T in point 2 of the above algorithm with the search for the 
maximal similarity. 

It is worth noting that the ALCA can produce different hierarchical 
trees depending on the use of a similarity matrix or a distance matrix. 
More precisely, different dendrograms can result for the ALCA due to the 
non linearity of the transformation of Eq. ^ This problem does not arise 
in the SLCA because Eq. |3]is a monotonic transformation and therefore it 
does not affect the search for the minimum (or maximum for the similarity) . 



The Planar Maximally Filtered Graph has been introduced in a recent 
paper (j25l). The basic idea is to obtain a graph that retains the same hi- 
erarchical properties of the MST, i.e. the same hierarchical tree of SLCA, 
but allowing a greater number of links and more complex topological struc- 
tures than the MST, such as loops and cliques. Such a graph is obtained 
by relaxing the topological constraint of the MST construction protocol of 
section . 2 . 1 1 according to which no loops are allowed in a tree. Specifically, 




if J 7^ /i and j ^ k 
otherwise, 



2.3. The Planar Maximally Filtered Graph 
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in the PMFG a link can be included in the graph if and only if the graph 
with the new link included is still planar. A graph is planar if and only if 
it can be drawn on a plane (infinite in principle) without edge crossings (|37h. 

The first difference between MST and PMFG is about the number of 
links, which is n—\ in the MST and 3(n— 2) in the PMFG. Furthermore loops 
and cliques are allowed in the PMFG. A clique of r elements, r-cliques, is a 
subgraph of r elements where each element is linked to each other. Because 
of the Kuratowski's theorem only 3-cliques and 4-cliques are allowed in 
the PMFG. The study of 3-cliques and 4-cliques is relevant for understand- 
ing the strength of clusters in the system (|25l ) as we will see below in the 
empirical applications. 

Concerning the hierarchical structure associated to the PMFG it has 
been shown in Ref. (jl25tl that at any step of construction of the MST and 
PMFG, if two elements are connected via at least one path in one of the 
considered graphs, then they also are connected in the other one. This 
statement implies that i) the MST is always contained in the PMFG and 
ii) the hierarchical tree associated to both the MST and PMFG is the one 
obtained from the SLCA. 

In summary the PMFG is a graph retaining more information about the 
system than the MST, the information being stored in the included new 
links and in the new topological structures allowed, i.e. loops and cliques. 



3. Empirical Results: Daily Data 

In the present section we apply the selected methods to a set of stocks 
traded at the LSE. These stocks are highly capitalized stocks and they 
belong to 11 different economic sectors. 

3.1. The Data Set 

We investigate the statistical properties of price returns for n = 92 
highly traded stocks belonging to the SETl segment of the LSE market 
www . londonstockexchange . com. In particular, we consider electronic trans- 
actions occurred in year 2002. The empirical data are taken from the "Re- 
build Order Book" database, maintained by the LSE. 

For each of the 92 stocks considered, the trading activity has been de- 
fined in terms of the total number of transactions (electronic and manual) 
occurred in 2002. Most of the transactions, a mean value of 75% for the 92 
stocks, are of the electronic type. 
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Table 1: Economic sectors of activity for 92 highly traded stocks belonging to the 
SETl segment of the LSE. The classification is done according to the methodology 
used in the web-site www . euroland . com. The second column contains the economic 
sector and the third column contains the number of stocks belonging to the sector. 





SECTOR 


NUMBER 


1 


Technology 


4 


2 


Financial 


20 


3 


Energy 


3 


4 


Consumer non-Cyclical 


12 


5 


Consumer Cyclical 


10 


6 


Healthcare 


6 


7 


Basic Materials 


5 


8 


Services 


19 


9 


Utilities 


6 


10 


Capital Goods 


5 


11 


Transportation 


2 



For each stock and for each trading day we consider the time series 
of stock price recorded transaction by transaction. Since transactions for 
different stocks do not happen simultaneously, we divide each trading day 
(lasting 8^ 30') into intervals of 5-minute each. For each trading day, we 
define 103 intraday stock price proxies Pj(tfc), with k = I,-- - ,103. The 
proxy is defined as the transaction price detected nearest to the end of 
the interval (this is one possible way to deal with high-frequency financial 
data (js^). By using these proxies, we perform the price returns = 
InPj(t) — lnPj(t — At) at time horizons of At = 5 minute and At equal 
to one trading day. In the case of a daily time horizon the returns are 
computed as the difference of the logarithms of the closure prices of each 
successive trading day. In the case of At = 5 minute, the returns are always 
computed as the difference of the logarithms of prices which belong to the 
same trading day. 

To each of the 92 selected stocks an economic sector of activity can 
be associated according to the classification scheme used in the web-site 
www . euroland . com. The relevant economic sectors are reported in Table 
n together with the number of stocks belonging to each of them (third 
column). 
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Fig. 1: Contribution of Eq. for the first (a), second (b), third (c), sixth 
(f), seventh (g), eighth (h) and ninth (i) eigenvectors of the correlation matrix of 
daily returns of 92 LSE stocks. Panel d) shows for the linear combination 
(u"* + u^)/\/2 and panel e) for the linear combination (u** — \r')/^/2. The order of 
sectors is the same as in Tabled 



3.2. Random Matrix Theory 

For a time horizon of one trading day the largest eigenvalue is Ai = 36.0 
clearly incompatible with RMT and suggesting a driving factor common 
to all the stocks. This is usually interpreted to be the "market mode" as 
described in widespread market models, such as the Capital Asset Pricing 
Model. The analysis of the components of the corresponding eigenvector 
confirms this interpretation. In fact the mean component of the first eigen- 
vector is 0.102 and the standard deviation is 0.022 showing that all the 
stocks contribute in a similar way to the eigenvector u^. 

In our data Q = T /n = 2.71 and the threshold value Xmax without 
taking into account the first eigenvalue is Xmax = 2.58. This implies that 
RMT considers as signal only the first two eigenvalues Ai and A2 = 4.58. On 
the other hand if we remove the contribution of the first eigenvalue with the 
procedure discussed in section OTTl we get Xmax = 1-57, indicating that the 
first 6 eigenvalues could contain economic information. This result shows 
the importance of taking into account the role of the first eigenvalue. 

FigureHshows of Eq.|31of the first 9 eigenvalues. Panel (a) shows that 
all the sectors contribute roughly in a similar way to the first eigenvector. 
On the other hand eigenvectors 2, 3, and 6 are characterized by one promi- 
nent sector. Specifically, the second eigenvector shows a large contribution 
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from the sector Consumer non-Cyclical (s = 4), the third eigenvector has 
a significant contribution from the Financial sector (s = 2), and the sixth 
eigenvector shows a prominent peak for the stocks of the sector Healthcare 
(s = 6). The fourth (A4 = 1.79) and fifth (A5 = 1.72) eigenvalues are very 
close and a plot of for the corresponding eigenvectors shows two peaks 
corresponding to the sectors Capital Goods and Technology. By following 
a line of reasoning presented in Ref. (|l^ a possible explanation is that the 
noise due to the measurement favors the mixing of these two groups. In 
support of this hypothesis in panel d) we show for the linear combina- 
tion (u^ + u^)/\/2 and in panel e) for the linear combination (u^ — u^)/\/2. 
Panel d) has a large peak for the sector Capital Goods {s = 10) and in 
panel e) the peak is associated to the Technology sector (s = 1). Finally 
the seventh, eighth and ninth eigenvector do not show significant peaks, 
indicating that probably these are eigenvectors of eigenvalues strongly af- 
fected by statistical uncertainty ("noise dressed"). It is interesting to note 
that the RMT after subtracting the contribution of the largest eigenvector 
as described above predicts that the first 6 eigenvalues are deviating, i.e. 
are outside the noise region. This is the same number one obtains from the 
analysis of eigenvector component. 



3.3. Single Linkage Correlation Based Clustering 

The results obtained by using the SLCA for the daily returns are sum- 
marized in Fig. island Fig. |21that show the hierarchical tree and the MST, 
respectively. 

The hierarchical tree shows that there exists a significant level of cor- 
relation in the market, and in some case clustering can be observed. In 
particular, the first two stocks on the left of Fig. [3 Shell (SHEL) and 
British Petroleum (BP), belonging to the Energy sector, are linked together 
at an ultrametric distance d'^ = 0.47 corresponding to a correlation coeffi- 
cient as high as p = 0.89. However, the third stock belonging to the Energy 
economic sector (stock 13), which is British Cas (BG), is not linked to the 
other two but it is linked to stocks belonging to the Financial sector. We fo- 
cus here our attention on the two sectors with the largest number of stocks, 
which are the Financial sector (s = 2) and the Services sector (s = 8). 
Panel a) of Fig. ^ gives an example in which some of the stocks belonging 
to the same economic sector, e.g. Financial, are clustered together. In fact, 
a cluster including 10 stocks from position 3 to position 12 can be observed. 
Panel b) of Fig. [^J gives an example of the opposite case in which stocks 
belonging to the same economic sector, e.g. Services, are poorly clustered. 
In fact, only two small clusters of two stocks are formed. 

The MST confirms the above results. In Fig. |21the stocks belonging to 



MantegnaKrakow printed on February 2, 2008 



13 




Fig. 2: Hierarchical tree obtained by using the SLCA starting from the daily price 
returns of 92 highly traded stocks belonging to the SETl segment of the LSE. 
Only electronic transactions occurred in year 2002 are considered. In panel a) the 
Financial economic sector is highlighted. In panel b) the Services economic sector 

is highlighted. 



the Financial economic sector (black circles) and the Services (gray circles) 
are indicated. An inspection of Fig. I^Jshows that the stocks of the Financial 
sector cluster around Royal Bank of Scotland (RBS) whereas stocks of the 
Services sector are present in different branches of the MST. The MST also 
gives an additional information about the topology of the network. In fact, 
it is evident from the figure that there are two stocks that behave as hub. 
One of them is RBS, which gathers 14 stocks, 10 of which belong to the 
Financial sector. The other hub is SHEL which gathers 10 stocks, among 
which we find EG and BP. 



3.4- Average Linkage Correlation Based Clustering 

In this subsection we analyze the dendrogram of Fig. 0] obtained by 
applying the ALGA to the correlation based distance matrix of the daily 
returns. Once again, to provide representative examples we focus our at- 
tention to the two sectors with the largest number of stocks. As in Fig. [21 
in panel a) of Fig. |3] the black lines are identifying stocks of the Financial 
sector. It can be seen from the figure that most of the stocks (specifically, 
16 out of 20) belonging to the Financial sector cluster together at a low level 
distance {d ~ 0.85). Exceptions (referring to black lines outside the cluster 
in panel a) from the left to the right) are Northern Rock (NRK), Royal & Sun 
Alliance (RSA), Canary Wharf Croup (WYFN) and Man Croup (EMG). 
Interestingly, RSA, WYFN and EMG are distant from the observed cluster 
also when considering the SLGA, as shown in panel a) of Fig. [2 at position 
37, 69 and 89, respectively. In panel b) of Fig. 13.41 the black lines are iden- 
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Fig. 3: MST obtained starting from the daily price returns of 92 highly traded 
stocks belonging to the SETl segment of the LSE. Only electronic transactions 
occurred in year 2002 are considered. The Financial economic sector (black) and 
the Services (gray) economic sector are highlighted. It is evident the existence of 
two stocks that behave as hubs. One of them is RBS, which gathers 14 stocks, 10 
of which belong to the Financial sector. The other hub is SHEL which gathers 10 

stocks. 



tifying the 19 stocks belonging to the Services sector. In this case just an 
intra-sector cluster of 3 stocks is detected, specifically the one composed by 
Vodafone Group (VOD), mm02 (OOM) and British Telecom (BT-A), the 
corresponding stock numbers in panel b) being respectively 44, 45 and 46. 

A comparison of the results obtained by using the SLCA and the ALCA 
shows a substantial agreement between the output of these two methods. 
However, a refined comparison shows that the ALCA provides a more struc- 
tured hierarchical tree. In Fig. [5] and Fig. [S] we show a graphical repre- 
sentation of the original correlation matrix done in terms of a contour plot. 
In the contour plot the gray scale represents the values of distances among 
stocks. In the figure we use as stock order the one obtained by SLCA and 
ALCA respectively. In both cases we also show the associated ultrametric 
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Fig. 4: Dendrogram associated to the ALCA performed on daily returns of a 
portfolio of 92 stocks traded in the LSE in 2002. Panel a): The black lines are 
identifying stocks belonging to the Financial Sector. Panel b): The black lines are 
identifying stocks belonging to the Services Sector 
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Fig. 5: Contour plots of the original correlation matrix (panel a)) and of the one 
associated to the ultrametric distance (panel b)) obtained by using the SLCA for the 
daily price returns of 92 highly traded stocks belonging to the SETl segment of the 
LSE. Only electronic transactions occurred in year 2002 are considered. Here the 
stocks are identified by a numerical label ranging from 1 to 92 and ordered according 
to the hierarchical tree of Fig. El The figure gives a pictorial representation of the 
amount of information which is filtered out by using the SLCA. 



matrices. A direct comparison of the ultrametric matrices confirms that 
ALCA is more structured than SLCA. Conversely, the SLCA selects ele- 
ments of the matrix with correlation values greater than the ones selected 
by ALCA and then less affected by statistical uncertainty. 
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Fig. 6: Contour plots of the original correlation matrix (panel a)) and of the one 
associated to the ultrametric distance (panel b)) obtained by using the ALCA for 
the daily price returns of 92 highly traded stocks belonging to the SETl segment of 
the LSE. Only electronic transactions occurred in year 2002 are considered. Here 
the stocks are ordered according to the hierarchical tree of Fig. 01 The figure gives 
a pictorial representation of the amount of information which is filtered out by 

using the ALCA. 



3.5. The Planar Maximally Filtered Graph 



In this section we analyze the topological properties of the PMFG of Fig. 
|7| obtained from the distance matrix of daily returns of the stock portfolio. 
In the figure we again point out the behavior of stocks belonging to the 
Financial and Services sectors. From the figure we can observe that the 
Financial sector (black circles) is strongly intra-connected (black thicker 
edges) whereas for the sector of Services (gray circles) we find just a few 
intra-sector connections (gray thicker edges). These results agree with the 
ones observed with the SLCA and the ALCA. The advantage of the study of 
the PMFG is that, through it, we can perform a quantitative analysis of this 
behavior. The existence in the graph of completely connected subgraphs, 
specifically 3-cliques and 4-cliques allows one to investigate the clustering 
level of sectors through a measure of the intra-cluster connection strength 
(|25h . This measure is obtained by considering a specific sector composed 
by rig elements and indicating with C4 and C3 the number of 4-cliques and 
3-cliques exclusively composed by elements of the sector. The connection 
strength qg of the sector s is therefore defined as 
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where we distinguish between the connection strength evaluated according 
to 4-chques and 3-chques q^. The quantities n^— 3 and Sn^— 8 are normal- 
izing factors. For large and strongly connected sectors both the measures 
give almost the same result (j25l ). When small sectors are considered the 
quantity is more significant than g^. Consider for instance a sector of 4 
stocks. In this case can assume the value or 1, whereas q^ can assume 
one the 5 values 0, 0.25, 0.5, 0.75 and 1, giving a measure of the clustering 
strength less affected by the quantization error. Note that in the case of 
= 4 if q^g assumes one of the values 0, 0.25, 0.5 and 0.75 then is always 
zero. In Table [2 the connection strength is evaluated for all the sectors 
present in the portfolio. The Financial sector has ^2 — 0-88 and q2 — 0.92. 
This last value is second only to the Energy sector (composed by 3 stocks) 
where all stocks are connected within them so that q^ = 1. The stocks 
belonging to the Energy sector are BG, BP and SHEL. We see in Fig. [7| 
that both BP and SHEL are characterized by high values of their degree 
(number of links with other elements). This fact implies that the Energy 
sector is strongly connected both within the sector and with other sectors. 
This behavior is different from what has been observed in the analysis of 
100 highly capitalized stocks traded in the US equity market (|25h . In Fig. 
Owe observe that stocks of the Financial sector are strongly connected with 
stocks belonging to different sectors. In particular RBS is the center of the 
biggest star in the graph. On the contrary, the sector of Services is poorly 
intra-connected: = and q^ = 0.02 and poorly connected to other sec- 
tors. In conclusion we observe two different behaviors. The Financial and 
Energy sectors are strongly intra-connected and strongly connected with 
other sectors. The sector of Services is poorly intra-connected and poorly 
interacting with other sectors. 



4. Empirical Results: 5-minute data 
4.1. Random Matrix Theory 

The properties of correlation matrix and of its eigenvalues and eigen- 
vectors change dramatically when one considers cross correlations between 
returns computed at a 5-minute time horizon. The largest eigenvalue is 
Ai = 11.2 and this sets the variance of the space orthogonal to it to 
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Fig. 7: PMFG obtained from daily returns of a set of 92 stocks traded in the 
LSE in 2002. Black circles are identifying stocks belonging to the Financial sector. 
Gray circles are identifying stocks belonging to the Services sector. Other stocks 
are indicated by empty circles. Black thicker lines are connecting stocks belonging 
to the Financial sector. Gray thicker lines are connecting stocks belonging to the 

Services sector. 

fj^ = 0.87. The noisy region of the spectrum is characterized by the val- 
ues Xmin = 0.78 and Xmax = 0.99. With these values one would conclude 
that 19 eigenvalues contain economic information. This is quite surprising 
because one would expect that for a short time horizon the correlation co- 
efficients are less influenced by economic sectors than when one considers 
daily returns. We will see in the following sections that clustering methods 
support this view. 

Figure IHl shows the components uj of the first eigenvector. In the x axis 
of this figure the stocks are sorted in decreasing order according to the total 
number of trades recorded in the investigated period. The figure shows 
that the most heavily traded stocks have a larger component in the first 
eigenvector. This behavior is not observed in the first eigenvector for daily 
returns. 

A possible interpretation of this result is the following. Suppose that, 
as a first approximation, the dynamics of the set of stocks is described by 
a one factor model, i.e. a model in which the dynamics of each variable is 
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Table 2: Intra-sector connection strength (daily returns) 



SECTOR 


Us 


it = CA/[ns - 3] 


g,^ = C3/[3n,-8] 


Technology 


4 


0/1 = 


1/4 = 0.25 


Financial 


20 


15/17^0.88 


48/52 ^ 0.92 


Energy 


3 




1/1 = 1 


Consumer non-Cyclical 


12 


2/9 ^ 0.22 


8/28 ^ 0.29 


Consumer Cyclical 


10 


1/7 ^ 0.14 


5/22 ^ 0.23 


Healthcare 


6 


0/3 = 


1/10 = 0.1 


Basic Materials 


5 


0/2 = 


3/7 ^ 0.43 


Services 


19 


0/16 = 


1/49 ^ 0.02 


Utilities 


6 


0/3 = 


0/10 = 


Capital Goods 


5 


0/2 = 


0/7 = 


Transportation 


2 
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Fig. 8: Components u\ of the first eigenvector of the correlation matrix of 5- 
minute returns. In the x axis of this figure the stocks are sorted in decreasing order 
according to the total number of trades recorded in the investigated period. 
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controlled by a single factor. The equation describing the one factor model 
is given by 

r^{t)=^a{t)+4'K,it), (7) 

where €i{t) is a Gaussian zero mean noise term with unit variance and it 
is assumed that the noise terms are uncorrelated one with each other and 
with the factor, i.e. (ej(t)ej(t)) = 6ij and {f{t)ej{t)) = The parameter 
"ff gives the fraction of variance explained by the common factor f{t) and 

^i^^ ~ ~ 'yf- '^^^ model describes a system where n variables are essen- 
tially controlled by a common factor describing a weighted mean. This type 
of model is, for example consistent with the Capital Asset Pricing Model of 
stock market behavior. It is possible to show (|39l ) that the spectrum of this 
model is given by a large eigenvalue Ai ~ Z^ILi ^^"^ n — 1 eigenvalues 
whose density can be obtained by using RMT. We wish to address here the 
question of the dependence of the first eigenvector from the ji parameters. 
It is possible to show that in the large n limit the first eigenvector is well 
approximated by the vector oc g = (71,72, ...,7n)"^- In fact the correla- 
tion matrix of the model of Eq. [7| has off diagonal elements pij = ji'jj for 
i 7^ j (jl^ . The product of the i-th row of the correlation matrix times the 
vector g gives 7,(1 + ^i-^jj]) ^ liYTj=il'j — li^i-, which implies that g 
well approximates the eigenvector in the large n limit. 

Thus the result shown in Fig. [H] can be interpreted in the following way. 
At 5-minute horizon the market is approximated by a one factor model of 
Eq. [7| The 7^ are related to the trading frequencies because more actively 
traded stocks are usually the ones with the highest capitalization and these 
stocks are the ones following more closely the mean behavior of the market, 
i.e. the common factor f{t). 

The sector analysis of 5-minute correlation matrix performed with RMT 
shows less clear results than for daily returns. Figure IHl shows the contri- 
bution of Eq. 121 for the first 9 eigenvalues of the correlation matrix of 
5-minute returns of 92 LSE stocks. The first, second, fourth, sixth, and es- 
pecially ninth eigenvector show peaks indicating the prominent role of one 
or few sectors in determining the dynamics of these eigenvectors. 

However, a systematic correspondence as in the case of daily returns 
is not observed. Moreover it is unclear what kind of information can be 
associated to the first 19 eigenvalues carrying information not affected by 
statistical uncertainty. 

It is therefore worth to consider what results are provided by correlation 
based clustering algorithms for the same time horizon. 
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Fig. 9: Contribution of Eq. |2| for the first 9 eigenvectors of the correlation 
matrix of 5-minute returns of 92 LSE stocks. The order of sectors is the same as 

in Table HI 



4-2. Single Linkage Correlation based Clustering 

At a 5-minute time horizon the structure of the MST and hierarchical 
tree are quite different from the analogous trees at a daily time horizon. 
Figure shows the hierarchical tree obtained by using the SLCA for the 
selected 92 stocks at a 5-minute time horizon. We proceed here in analogy 
with the discussion done for the one day time horizon to put in emphasis 
similarities and differences between the results obtained for the two time 
horizons. Specifically, in panel a) all stocks belonging to the Financial 
economic sector are highlighted, while in panel b) the stocks belonging to 
the Services economic sector are highlighted. The hierarchical tree shows 
that now the mean level of correlation in the market is lower than at one 
day time horizon. The level of clustering is also less pronounced at this time 
horizon. In fact, panel a) of Fig. IIUI shows how the stocks of the Financial 
sector are only poorly clustered, contrary to the case shown in panel a) of 
Fig. 121 Panel b) of Fig. shows that at a 5-minute time horizon there is 
absence of any amount of clustering for stocks of the Services sector. 

In Fig. ^2 the MST of the 92 stocks computed at a 5-minute time 
horizon is shown. As in Fig. |31 the stocks belonging to the Financial sector 
(black circles) and Services sector (gray circles) are highlighted. Several 
stocks of the Financial sector cluster around RBS. The organization of the 
92 stocks around two hubs (SHEL and RBS) is here more pronounced than 
at a daily time horizon. In particular, RBS has now a degree of 29 and 
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Fig. 10: hierarchical tree obtained by using the SLCA starting from the S-minute 
price returns of 92 highly traded stocks belonging to the SETl segment of the LSE. 
Only electronic transactions occurred in year 2002 are considered. In panel a) the 
Financial economic sector is highlighted. In panel b) the Services economic sector 

is highlighted. 



SHEL has a degree of 17. However, while at a daily time horizon 10 stocks 
of the Financial sector are linked to RBS, at the present time horizon only 
7 Financial stocks are linked to RBS. A possible interpretation is that RBS 
acts as hub mainly for its economic sector at a daily time horizon, while at 
a shorter time horizon, when economic sectors are expected to play a minor 
role, RBS is influential for the whole stock market. These results are similar 
to what has been observed for 100 stocks traded in US equity markets in 
Ref. (HI). 

4-3. Average Linkage Correlation based Clustering 

In Fig. ^Jwe show the dendrogram obtained for the 5-minute returns by 
applying the ALCA to the correlation based distance matrix of the system. 
In panel a) of Fig. the black lines are again identifying the Financial 
sector. In the figure, we observe that just an intra-sector cluster of 3 el- 
ements is formed. Specifically, Lloyds TSB Group (LLOY), RBS, HSBC 
Holdings (HSBA) cluster together at a distance level d ~ 1.08. In panel b) 
the black lines are identifying stocks belonging to the Services sector. As in 
the case of daily returns only an intra-sector cluster of 3 stocks is recognized 
by the ALCA. It involves stocks Dixons Group (DXNS), Boots (BOG) and 
Compass Group (CPG). 

A direct comparison of Fig. and Fig. ^ shows that at the time 
horizon of 5-minute the Financial cluster observed for daily returns is not yet 
formed. More generally a strong reduction of structures in the dendrogram 
is observed when going from daily returns to 5-minute returns. 



MantegnaKrakow printed on February 2, 2008 



23 




Fig. 11: MST obtained starting from the 5-minute price returns of 92 highly traded 
stocks belonging to the SETl segment of the LSE. Only electronic transactions 
occurred in year 2002 are considered. The Financial economic sector (black) and 
the Services (gray) economic sector are highlighted. It is evident the existence of 
two stocks that behave as hubs. One of them is RBS, which gathers 14 stocks, 10 
of which belong to the Financial sector. The other hub is SHEL which gathers 10 

stocks. 



4.4- The Planar Maximally Filtered Graph 

Lastly we discuss the properties of the PMFG obtained for the portfolio 
of stocks by considering 5-minute returns. A comparison of Fig. EJand Fig. 
Elshows that the PMFG experiences a major modification. In fact, if we just 
focus on the stocks with the highest value of degree, some of them increase 
their degree whereas others decrease their own. Specifically, RBS and SHEL 
increase their degree from 42 to 62 and from 24 to 37 respectively, whereas 
BP and Amvescap (AVZ) decrease their degree from 23 to 18 and from 24 to 
5 respectively. This difference shows that the role of most connected stock 
can be quite different at different time horizons. 

In TableOlthe intra-sector connection strength discussed in section 1^31 is 
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Fig. 12: Dendrogram associated to tlie ALCA performed on 5-minute returns of 
a portfolio of 92 stocks traded in the LSE in 2002. Panel a): The black lines are 
identifying stocks belonging to the Financial Sector. Panel b): The black lines are 
identifying stocks belonging to the Services Sector 



evaluated for the 5-minute time horizon. Table El shows that only three sec- 
tors have a connection strength different from zero. Specifically the Energy 
sector has connection strength = 1 and the Financial sector — O-^l 
and — ^•'^^ indicating a behavior of both the sectors similar to the one 
observed for daily returns. Finally the sector of Services has ql ^ 0.06 re- 
vealing a clustering of the same order of the one observed for daily returns. A 
critical difference between the two time horizons appears for several sectors. 
The most striking example being the sector of Basic Materials. In Table El 
we see that the connection strength of the sector is zero, with respect to 
both the connection strength measures. On the contrary when daily returns 
are considered the connection strength = 0.43 was observed. This dif- 
ference suggests that the intra-sector correlation of Basic Materials stocks 
needs time to be settled up into the market. Several of the remaining sec- 
tors show a behavior analogous to the one of Basic Materials. This effect is 
detected by all the considered techniques, thus indicating the need of time 
for the market to assess a certain degree of correlation among stocks. 



5. Conclusions 

All the methods considered in the present paper are able to detect in- 
formation about economic sectors of stocks starting from the synchronous 
correlation coefficient matrix of return time series. The degree of efficiency 
in the detection is depending on the return time horizon. Specifically, the 
system is more hierarchically structured at daily time horizons confirming 
that the market needs a finite amount of time to assess the correct degree 
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Fig. 13: PMFG obtained from 5-minute returns of a set of 92 stocks traded in the 
LSE in 2002. Black circles are identifying stocks belonging to the Financial sector. 
Gray circles are identifying stocks belonging to the Services sector. Other stocks 
are indicated by empty circles. Black thicker lines are connecting stocks belonging 
to the Financial sector. Gray thicker lines are connecting stocks belonging to the 

Services sector. 

of cross correlation between pairs of stocks whose prices are simultaneously 
recorded ([Tgl). Our comparative study shows that, at a given time horizon, 
the considered methods can provide different information about the sys- 
tem. For example, at one day time horizon the method based on RMT pre- 
dominantly associates the eigenvectors of the six highest eigenvalues which 
are not affected by statistical uncertainty respectively to the market factor 
(first eigenvalue and eigenvector), the Consumer non-Cyclical sector (second 
eigenvalue), the Financial sector (third eigenvalue), a linear combination of 
Technology and Capital Goods sectors (fourth and fifth eigenvalues) and 
the Helthcare sector (sixth eigenvalue). In the present case, RMT does not 
provide information about the existence and strength of economic relation 
between stocks belonging to the sectors of Energy, Consumer Cyclical, Basic 
Materials, Services, Utilities and Transportation. A detailed investigation 
of the hierarchical trees obtained by the SLCA and ALCA shows that these 
methods are able to detect efficiently most of the clusters detected with the 
methods of the RMT and also other clusters related to other sectors. The 
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Table 3: Intra-sector connection strength (5-minute returns) 



SECTOR 


Us 


g| = C4/[ns - 3] 


q^,=C3/[3ns-8] 


Technology 


4 


0/1 = 


0/4 = 


Financial 


20 


12/17^0.71 


39/52 ^ 0.75 


Energy 


3 




1/1 = 1 


Consumer non-Cyclical 


12 


0/9 = 


0/28 = 


Consumer Cyclical 


10 


0/7 = 


0/22 = 


Healthcare 


6 


0/3 = 


0/10 = 


Basic Materials 


5 


0/2 = 


0/7 = 


Services 


19 


0/16 = 


3/49 ^ 0.06 


Utilities 


6 


0/3 = 


0/10 = 


Capital Goods 


5 


0/2 = 


0/7 = 


Transportation 


2 







only sector that RMT detects in a way which is more efficient with respect 
to the correlation based clustering procedures is the cluster of stocks be- 
longing to Consumer non-Cyclical sector. One sector which is not detected 
by both RMT and hierarchical clustering methods is the sector of Services. 
RMT is not able to detect it whereas SLCA and ALCA are able to detect 
only limited aggregation of elements of it. 

Our comparative analysis of the hierarchical clustering methods shows 
that SLCA and ALCA also provide different information. Specifically, the 
SLCA is providing information about the highest level of correlation of 
the correlation matrix whereas the ALCA averages this information within 
each considered cluster. In this way the average linkage clustering is able to 
provide a more structured information about the hierarchical organization 
of the stocks of a portfolio. 

Additional information with respect to the one associated with the MST 
of the system can be also detected by considering the properties of the 
PMFG. This graph provides quantitative information about the degree of 
inter-cluster and intra-cluster connection of the various elements. 

In summary, we believe that our empirical comparison of different meth- 
ods provide an evidence that RMT and hierarchical clustering methods are 
able to point out information present in the correlation matrix of the inves- 
tigated system. The information that is detected with these methods is in 
part overlapping but in part specific to the selected investigating method. 
In short, all the approaches detect information but not exactly the same 
one. For this reason an approach that simultaneously makes use of several 
of these methods may provide a better characterization of the investigated 
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system than an approach based on just one of them. 
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